Summary of the previous lecture

In our previous common session, we have introduced some fundamental notions of the Python language. Let's review some of them!

Libraries and import statements


In [39]:
from idai_journals import nlp as dainlp
import re
from treetagger import TreeTagger

In [40]:
from nltk.tag import StanfordNERTagger
from nltk.chunk.util import tree2conlltags
from nltk.chunk import RegexpParser
from nltk.tree import Tree
from nltk.tag import StanfordNERTaggelr

Data types


In [ ]:
#interges and floats
3 + 0.5
#strings
"hello"
#Booleans
True

Data collections (and variables)


In [41]:
#lists (can also contain multiple different data types)
li  = ["Leipzig", "London", "Berlin", "Boston", 4, False]
#tuples (like lists, but immutable)
tu = ("tuple", "list", "dictionary")
#dictionaries (key : value pairs)
di = {"key" : "value", "other-key" : "second value"}

For loops and if statements


In [42]:
#home assignment: try to figure out what the if statement (line 2) does
for l in li:
    if isinstance(l, str):
        print(l)


Leipzig
London
Berlin
Boston

Functions


In [43]:
def printMe(message):
    print(message)
    
printMe("Hello, world!")
printMe("goodbye...")


Hello, world!
goodbye...

Handling exceptions


In [44]:
l = ["zero", "one", "two", "three"]
l[10]


---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-44-9c6d14a570dc> in <module>()
      1 l = ["zero", "one", "two", "three"]
----> 2 l[10]

IndexError: list index out of range

In [45]:
try:
    l[10]
except IndexError:
    print("hey, your index is way too high!")


hey, your index is way too high!

A bonus: objects

Objects might be a bit complicated, but they're very important to understand the code written by other people, while most of the programs that you'll find around is written using classes and objects. Oh, and the good news is... you've already met them!

What are "objects" in a programming language like Python? Well, I like to think about them as... magical, animated tools!

Say that you want to fetch water from a well (and maybe clean some of the mess...). Well, the object-oriented approach to this tak consists in creating one or more magic brooms that go and fetch the water for you! In order to create them, you have to conceptualize the broom in terms of:

  • the special features it has (e.g. number of buckets carried, speed...)
  • the actions that it can execute (fetch water, clean the floor)

That's it! In programming parlance, the features are called properties of the object; the actions are called methods.

When you want to build your own magic brooms you first create a sort of prototype for each of them (which is called the class of magic brooms); then you can go on and create as many brooms as you want...

Here's how to do it! (very simplified)


In [46]:
class MagicBroom():
    #this is called "constructor"; it's a special method
    def __init__(self, name, speed=20):
        self.name = name
        self.buckets = 2
        self.speed = speed

    def greet(self):
        print("Hello, my name is %s! What can I do for you?" % self.name)
    
    def fetchWater(self):
        if self.speed >= 20:
            print("Yes, sir! I'll be back in a sec!")
        else:
            print("Allright, but I am taking my time!")

In [48]:
mickey = MagicBroom("Mickey")
mickey.greet()


Hello, my name is Mickey! What can I do for you?

In [49]:
peter = MagicBroom("Peter", speed=5)

In [50]:
mickey.speed


Out[50]:
20

In [51]:
mickey.fetchWater()


Yes, sir! I'll be back in a sec!

In [52]:
peter.fetchWater()


Allright, but I am taking my time!

Regular expressions

How would you find all the numbers in this sentence?

The set of integers consists of zero (0), the positive natural numbers (1, 2, 3, …), also called whole numbers or counting numbers,[1][2] and their additive inverses (the negative integers, i.e., −1, −2, −3, …). This is often denoted by a boldface Z ("Z") or blackboard bold Z {\displaystyle \mathbb {Z} } \mathbb {Z} (Unicode U+2124 ℤ) standing for the German word Zahlen ([ˈtsaːlən], "numbers").[3][4] ℤ is a subset of the sets of rational and real numbers and, like the natural numbers, is countably infinite.


In [53]:
wiki = 'The set of integers consists of zero (0), the positive natural numbers (1, 2, 3, …), also called whole numbers or counting numbers,[1][2] and their additive inverses (the negative integers, i.e., −1, −2, −3, …). This is often denoted by a boldface Z ("Z") or blackboard bold Z {\displaystyle \mathbb {Z} } \mathbb {Z} (Unicode U+2124 ℤ) standing for the German word Zahlen ([ˈtsaːlən], "numbers").[3][4] ℤ is a subset of the sets of rational and real numbers and, like the natural numbers, is countably infinite.'

We'd need a way to tell our machine not to look for specific strings, bur rather for classes of strings, i.e. using some sort of meta-character to catch a whole group of signs (e.g. the numbers); then we'd need to tell to optionally include/exclude some other signs, or to catch the numbers only if they're not preceeded/followed by other signs...

That's precisely what Regular Expressions do! They allow you to express a query as a string of metacharacters (or groups of metacharacters).

How do we use them in Python? First, we need to import a module from the Standard Library (i.e. you already have them with Python: no need to install external libraries)


In [55]:
import re

A cool feature of RegExp in Python is that you can create your complicated patterns as objects (and assign them to variables)! That's right, RegExp patterns are your magic brooms...


In [56]:
#here is one to catch all numbers
reg = re.compile(r'[0-9]+') #or: r'\d+'
type(reg)


Out[56]:
_sre.SRE_Pattern

The Pattern object has a number of interesting methods to search and replace the pattern. Generally, you use them with the text that must be searched as an argument. For instance, findall returns all matches as a list


In [57]:
reg.findall(wiki)


Out[57]:
['0', '1', '2', '3', '1', '2', '1', '2', '3', '2124', '3', '4']

Kind of a sloppy job we did! The negative numbers are not captured as negative; the footnote reference (e.g. [1], [4]) are also captured and we don't want them... We can do better. Let's improve our pattern so that we include the '-' signs (if present) and we get rid of the footnotes


In [58]:
reg = re.compile(r'(?<!\[)−?\d+(?!\])') # the 'r' is there to make sure that we don't have to "escape the escape" sing (\)
reg.findall(wiki)


Out[58]:
['0', '1', '2', '3', '−1', '−2', '−3', '2124']

Now it's time to go back to our task of (Named) Entity recognition and extraction task. But we're going to use RegExp patterns and syntax quite a few times now...

Extracting dates and persons from texts

As Matteo said last time, the concept of "named entity" is domain- and task- specific. While a person's or a place's name will more or less always fall under the definition, in some contexts of information extraction people might be interested in other kinds of real-life "entities", such as time references (months, days, dates) or museum objects, which are not relevant to others.

In this exercise, we are going to expand on what Matteo did last time with proper names in Latin and look at two specific classes of "entities" mentioned in a modern scientific text about ancient history: dates and persons.

A modern text in English

First, let's grab a text.

We will be working with an English article on Roman history. The article is: Frederik Juliaan Vervaet, The Praetorian Proconsuls of the Roman Republic (211–52 BCE). A Constitutional Survey, Chiron 42(2012): 45-96.

Let's start by loading the text and inspect the first 10.000 characters (we'll be working with just the first 10k words)


In [59]:
with open("data/txt/article446_10k.txt") as f:
    txt = f.read()

In [60]:
txt[:1000]


Out[60]:
'\x0cThe Praetorian Proconsuls of the Roman Republic 45 \nFREDERIK JULIAAN VERVAET \nThe Praetorian Proconsuls of the Roman Republic (211–52 BCE). \nA Constitutional Survey \n1. Introduction \nThe republican administrative procedure of sending out praetors with consular imperium is reasonably well-known but little understood. To the best of my knowledge, \nnot a single study or book chapter has been devoted exclusively to a gubernatorial \npractice that rapidly gained importance from the Second Punic War. This bipartite \nstudy aims at bridging this remarkable gap. The first component of this inquiry endeavours to offer an overall constitutional survey of the institutional phenomenon of \nthe praetura pro consule by discussing its origins, nature and historical development. \nThe second part is conducted by F. Hurlet and scrutinizes this practice as recorded \nin the fasti of Africa, Sicily and Corsica-Sardinia. After highlighting the significance of \nthe Metilian Law from 217 BCE as a precedent, the'

Part-Of-Speech (POS) and Named-Entity (NE) Tagging

Most of the time POS tagging is the precondition before you can perform any other advanced operation on a text

As we did with Matteo last time, by "tagging" we mean the coupling of each word with a tag that describe some property of the word itself. Part-of-speech tags define what word class (e.g. "verb", or "proper noun") a text token belongs to.

There are several tagset used for each language, and several software (pos taggers) who can tag your text automatically. One of the most used is TreeTagger, which has pretrained classifiers for many languages.

Let's run it from Python, using one of the few "wrappers" available


In [61]:
#first we load the library
from treetagger import TreeTagger

In [62]:
#That's right! we start by creating a Tagger "magic broom" (a Tagger object)
tt = TreeTagger(language="english")

#then we tag our text
tagged = tt.tag(txt)

In [63]:
tagged[:20]


Out[63]:
[['The', 'DT', 'the'],
 ['Praetorian', 'JJ', 'Praetorian'],
 ['Proconsuls', 'NNS', 'proconsul'],
 ['of', 'IN', 'of'],
 ['the', 'DT', 'the'],
 ['Roman', 'NP', 'Roman'],
 ['Republic', 'NP', 'Republic'],
 ['45', 'CD', '@card@'],
 ['FREDERIK', 'NP', 'Frederik'],
 ['JULIAAN', 'NP', '<unknown>'],
 ['VERVAET', 'NP', '<unknown>'],
 ['The', 'DT', 'the'],
 ['Praetorian', 'JJ', 'Praetorian'],
 ['Proconsuls', 'NNS', 'proconsul'],
 ['of', 'IN', 'of'],
 ['the', 'DT', 'the'],
 ['Roman', 'NP', 'Roman'],
 ['Republic', 'NP', 'Republic'],
 ['(', '(', '('],
 ['211–52', 'JJ', '<unknown>']]

Named Entity Recognition (using a tool like the Stanford NER that we saw in our last lecture) is also a way of tagging the text, this time using information not on the word class but on a different level of classification (place, person, organization or none of the above).

Let's do this too


In [64]:
#first, we define the path to the English classifier for Stanford NER
english_classifier = 'english.all.3class.distsim.crf.ser.gz'
twords = [w[0] for w in tagged]

In [65]:
#then... guess what? Yes, we create a NER-tagger Magic Broom ;-)
from nltk.tag import StanfordNERTagger

ner_tagger = StanfordNERTagger(english_classifier)
ners = ner_tagger.tag(twords)

In [66]:
#not very pretty...
ners[:20]


Out[66]:
[('The', 'O'),
 ('Praetorian', 'O'),
 ('Proconsuls', 'O'),
 ('of', 'O'),
 ('the', 'O'),
 ('Roman', 'LOCATION'),
 ('Republic', 'LOCATION'),
 ('45', 'O'),
 ('FREDERIK', 'O'),
 ('JULIAAN', 'O'),
 ('VERVAET', 'O'),
 ('The', 'O'),
 ('Praetorian', 'O'),
 ('Proconsuls', 'O'),
 ('of', 'O'),
 ('the', 'O'),
 ('Roman', 'LOCATION'),
 ('Republic', 'LOCATION'),
 ('(', 'O'),
 ('211–52', 'O')]

Chunking

As we saw, when we analyze a text we proceed word by word (more exactly: token by token). However, Named Entities (now including dates) often span over more than one token. The task of sub-dividing a section of text into phrases and/or meaningful constituents (which may include 1 or more text tokens) is called chunking

In the image above, the tokens are [We, saw, the, yellow, dog]. Two Noun Phrases (NP) can be chunked:

  • "we" (1 token)
  • "the yellow dog" (3 tokens)

The IOB notation that Matteo introduced last time is a popular way to store the information about chunks in a word-by-word format. In the case of "the yellow dog", we will have:

  • saw = not in a chunk --> O
  • the = beginning of the chunk --> B-NP
  • yellow = internal part of the chunk --> I-NP
  • dog = internal part of the chunk --> I-NP

The easiest method for chunking a sentence in Python is to use the information in the Tag and a regexp syntax.

For example, if we have:

in O New LOCATION York LOCATION City LOCATION

We easily see that the 3 tokens tagged as LOCATION go together. We may thus write a grammar rule that chunks the LOC together:

LOC: {<LOCATION><LOCATION>*}

Which means group in a chunk named LOC every token tagged as LOCATION, including any token tagged as LOCATION that might optionally come after.

And the same goes also for PERSONS and ORGANIZATIONS. We may even use RegExp syntax to be more tollerant and make room for annotation errors, in case e.g. the two tokens Geore Washington are wrongly tagged as PERSON and LOCATION.

Here's how I'd do it (it's not perfect at all but it should work in most cases)...


In [68]:
from nltk.chunk import RegexpParser

english_chunker = RegexpParser(r'''
LOC:
    {<LOCATION><(PERSON|LOCATION|MISC|ORGANIZATION)>*}
''')

Let's see it in action with the first few words


In [69]:
tree = english_chunker.parse(ners[:20])
print(tree)


(S
  The/O
  Praetorian/O
  Proconsuls/O
  of/O
  the/O
  (LOC Roman/LOCATION Republic/LOCATION)
  45/O
  FREDERIK/O
  JULIAAN/O
  VERVAET/O
  The/O
  Praetorian/O
  Proconsuls/O
  of/O
  the/O
  (LOC Roman/LOCATION Republic/LOCATION)
  (/O
  211–52/O)

Well... OK, "Roman Republic" is not a location, but at least the chunking is exactly what we wanted to have, right?

Export to IOB notation

OK, but now how do we convert this to the IOB notation?

Luckily, there's a ready-made function in a module from the NLTK library! Let's load and use it

(just in case, there is also a function that does the reverse: from IOB to tree)


In [70]:
from nltk.chunk.util import tree2conlltags

In [72]:
iobs = tree2conlltags(tree)

In [73]:
iobs


Out[73]:
[('The', 'O', 'O'),
 ('Praetorian', 'O', 'O'),
 ('Proconsuls', 'O', 'O'),
 ('of', 'O', 'O'),
 ('the', 'O', 'O'),
 ('Roman', 'LOCATION', 'B-LOC'),
 ('Republic', 'LOCATION', 'I-LOC'),
 ('45', 'O', 'O'),
 ('FREDERIK', 'O', 'O'),
 ('JULIAAN', 'O', 'O'),
 ('VERVAET', 'O', 'O'),
 ('The', 'O', 'O'),
 ('Praetorian', 'O', 'O'),
 ('Proconsuls', 'O', 'O'),
 ('of', 'O', 'O'),
 ('the', 'O', 'O'),
 ('Roman', 'LOCATION', 'B-LOC'),
 ('Republic', 'LOCATION', 'I-LOC'),
 ('(', 'O', 'O'),
 ('211–52', 'O', 'O')]

Regex tagger

Now, to go back to our original task, how do we use all this to annotate the dates and export them to IOB?

Dates are often just numbers (e.g. "2017"); sometimes they come in more complex formats like: "14 September 2017" or "14-09-2017".

One very simple solutions to find them and annotate them with a chunking notation might be to tag the tokens of our text with a very simple custom tagset that we design for dates. We assign "O" to all tokens, save the numbers (that we tag "CD") and some selected time formats or expressions, like the months of the year or the sequence number-number. We use the tag "Date" for them.

In order to do this, we need:

  • regular expression syntax
  • a tagger that works with RegExp patterns

A module of NLTK provides with exactly that tagger that can work with RegExp syntax


In [74]:
from nltk.tag import RegexpTagger

In [87]:
#here is our list of patterns
patterns = [
    (r'\d+$', 'CD'),
    (r'\d+[-–]\d+$', "Date"),
    (r'\d{1,2}[-\.\/]\d{1,2}[-\.\/]\d{2,4}', "Date"),
    (r'January|February|March|April|May|June|July|August|September|October|November|December', "Date"),
    (r'\d{4}$', "Date"),
    (r'BCE|BC|AD', "Date"),
    (r'.*', "O")
]

In [88]:
#Our RegexpTagger magic broom! We initialize it with our pattern list
tagger = RegexpTagger(patterns)

In [77]:
#let's test it with a trivial example
tagger.tag("I was born on September 14 , or 14-09".split(" "))


Out[77]:
[('I', 'O'),
 ('was', 'O'),
 ('born', 'O'),
 ('on', 'O'),
 ('September', 'Date'),
 ('14', 'CD'),
 (',', 'O'),
 ('or', 'O'),
 ('14-09', 'Date')]

Now let's see it in action on the real stuff


In [89]:
reg_tag = tagger.tag(twords)

In [90]:
reg_tag[:50]


Out[90]:
[('The', 'O'),
 ('Praetorian', 'O'),
 ('Proconsuls', 'O'),
 ('of', 'O'),
 ('the', 'O'),
 ('Roman', 'O'),
 ('Republic', 'O'),
 ('45', 'CD'),
 ('FREDERIK', 'O'),
 ('JULIAAN', 'O'),
 ('VERVAET', 'O'),
 ('The', 'O'),
 ('Praetorian', 'O'),
 ('Proconsuls', 'O'),
 ('of', 'O'),
 ('the', 'O'),
 ('Roman', 'O'),
 ('Republic', 'O'),
 ('(', 'O'),
 ('211–52', 'Date'),
 ('BCE', 'Date'),
 (')', 'O'),
 ('.', 'O'),
 ('A', 'O'),
 ('Constitutional', 'O'),
 ('Survey', 'O'),
 ('1', 'CD'),
 ('.', 'O'),
 ('Introduction', 'O'),
 ('The', 'O'),
 ('republican', 'O'),
 ('administrative', 'O'),
 ('procedure', 'O'),
 ('of', 'O'),
 ('sending', 'O'),
 ('out', 'O'),
 ('praetors', 'O'),
 ('with', 'O'),
 ('consular', 'O'),
 ('imperium', 'O'),
 ('is', 'O'),
 ('reasonably', 'O'),
 ('well-known', 'O'),
 ('but', 'O'),
 ('little', 'O'),
 ('understood', 'O'),
 ('.', 'O'),
 ('To', 'O'),
 ('the', 'O'),
 ('best', 'O')]

Now we just need to chunk it and export it to IOB. Then we are ready to evaluate the manual annotation...

First, we have to define a chunker


In [91]:
date_chunker = RegexpParser(r'''
DATE:
    {<CD>*<Date><Date|CD>*}
DATE:
    {<CD>+}
''')

In [92]:
t = date_chunker.parse(reg_tag)

#we use that function to make sure that the tree is not too complex to be converted
flat = dainlp.flatten_tree(t)

In [93]:
iob_list = tree2conlltags(flat)

In [94]:
iob_list[:50]


Out[94]:
[('The', 'O', 'O'),
 ('Praetorian', 'O', 'O'),
 ('Proconsuls', 'O', 'O'),
 ('of', 'O', 'O'),
 ('the', 'O', 'O'),
 ('Roman', 'O', 'O'),
 ('Republic', 'O', 'O'),
 ('45', 'CD', 'B-DATE'),
 ('FREDERIK', 'O', 'O'),
 ('JULIAAN', 'O', 'O'),
 ('VERVAET', 'O', 'O'),
 ('The', 'O', 'O'),
 ('Praetorian', 'O', 'O'),
 ('Proconsuls', 'O', 'O'),
 ('of', 'O', 'O'),
 ('the', 'O', 'O'),
 ('Roman', 'O', 'O'),
 ('Republic', 'O', 'O'),
 ('(', 'O', 'O'),
 ('211–52', 'Date', 'B-DATE'),
 ('BCE', 'Date', 'I-DATE'),
 (')', 'O', 'O'),
 ('.', 'O', 'O'),
 ('A', 'O', 'O'),
 ('Constitutional', 'O', 'O'),
 ('Survey', 'O', 'O'),
 ('1', 'CD', 'B-DATE'),
 ('.', 'O', 'O'),
 ('Introduction', 'O', 'O'),
 ('The', 'O', 'O'),
 ('republican', 'O', 'O'),
 ('administrative', 'O', 'O'),
 ('procedure', 'O', 'O'),
 ('of', 'O', 'O'),
 ('sending', 'O', 'O'),
 ('out', 'O', 'O'),
 ('praetors', 'O', 'O'),
 ('with', 'O', 'O'),
 ('consular', 'O', 'O'),
 ('imperium', 'O', 'O'),
 ('is', 'O', 'O'),
 ('reasonably', 'O', 'O'),
 ('well-known', 'O', 'O'),
 ('but', 'O', 'O'),
 ('little', 'O', 'O'),
 ('understood', 'O', 'O'),
 ('.', 'O', 'O'),
 ('To', 'O', 'O'),
 ('the', 'O', 'O'),
 ('best', 'O', 'O')]

In [95]:
#then we can write it on an output file
with open("data/iob/article_446_date_aut.iob", "w") as out:
    for i in iob_list:
        out.write("\t".join(i)+"\n")

Exercise

In the practical exercise, you are requested to extract the person names from the same article that we used for dates. You will annotate them using the Stanford NER with the pre-trained classifier for English that come with the software; extract the Person chunks; evaluate the results against a golden standard.

Here is a summary of the steps that you will have to execute in order to solve the exercise:

  • load the file: data/txt/article446_10k.txt and read its content
  • annotate the Named Entities using Stanford NER
  • define an appropriate chunker for Persons
  • chunk the extracted Named Entities
  • convert the chunked Tree into IOB format
  • evaluate the IOB annotation using the appropriate functions
    • use the file: data/iob/article_446_person_GOLD.iob as gold standard
  • report the final evaluation metrics (precision, recall, F-score)

In [ ]:
#just remember that the path to the English pre-trained classifier for Stanfor NER is
english_classifier = 'english.all.3class.distsim.crf.ser.gz'

For the evaluation of the accuracy of your classifier, you can adapt the following lines of code:

from sklearn.metrics import precision_recall_fscore_support
precision, recall, fscore, support = precision_recall_fscore_support(gold_labels
                                                                     , auto_labels
                                , average="micro"
                                , labels=["B-DATE","I-DATE"])
print("Precision: {0:.2f}".format(precision))
print("Recall: {0:.2f}".format(recall))
print("F1-score: {0:.2f}".format(fscore))

Things you'll need to change/provide:

  • list of positive labels (variable labels)
  • gold_labels: a list with the correct labels
  • auto_labels: a similar list with the labels output by your classifier.

NB: make sure that gold_labels and auto_labels are of the same lenght, i.e. that both labels at position n in both lists refer to the same token.